This document provides initial scaffolding for a conceptual critique of cross-sectional network approaches to psychopathology symptom data. In particular, we articulate the view that structural equation modeling (SEM) provides a statistical formalism that can accommodate a range of covariance structures, largely encompassing networks and also allowing one to examine alternative representations of latent structure. To highlight that nodal centrality metrics largely recapitulate free parameters in SEM, we conduct a series of simulations of latent variable models analyzed using a network approach, as well as simulated networks analyzed through latent variable models. We highlight conceptual and quantitative limitations of the network approach for resolving the latent structure of psychopathology and argue that principled model selection and comparison in SEM provides a strong basis for resolving alternative accounts of how symptoms covary with each other.
[[A point of some convergence is the recent Epskamp Psychometrika paper that blends latent variables and networks. This allows one to consider measurement error, to measure/test a hypothetical construct, and to compare model fit of a GGM against a SEM directly (i.e., same likelihood function). I suggest we include a summary of this later in the paper as a rapprochement and to highlight this as a promising direction.]]
One of the most difficult challenges faced by clinicians and researchers alike is to conceptualize the co-occurrence of symptoms of psychopathology. Do certain symptoms co-occur because they reflect an underlying clinical entity or a shared latent trait? In a given individual, how should symptom co-occurrence inform treatment targets? Developing within-person models of symptom dynamics is an important research topic for clinical psychological science that may help to unpack the causal relationships among symptoms (Wright et al., 2016). For example, knowing that a given patient often becomes suicidal after a marital conflict may lead a clinician to make a referral for couples therapy or to work with the individual on interpersonal skills. At a between-persons level, anhedonia often precedes the development of suicidal ideation (Winer et al., 2014) and thus, on average, it may be beneficial to treat anhedonia before suicidal behavior develops.
Motivated by such questions, there has been a surge of interest in network analyses of psychopathology that could potentially reveal dynamic or causal relationships among symptoms (Borsboom et al., 2016; Hofmann, Curtiss, & McNally, 2016). More specifically, in cross-sectional samples, many research groups have used metrics derived from graph theory to explore whether particular symptoms are central in a network derived from psychometric indicators of one or more forms of psychopathology. Centrality within a symptom co-occurrence network, it has been argued, may help to identify important clinical targets that play a crucial role in precipitating other problems or that may be particularly salient indicators of a given disorder.
Although the motivation to understand dynamic relationships among features of psychopathology is well-founded, in this critical review, we highlight three important problems with applying network approaches to cross-sectional symptom data. First, the assumptions of network analyses (e.g., graphical lasso models) about the generative processes underlying the data are often not mentioned and may not align with many scientific questions. For example, networks based on partial correlations (i.e., where symptoms A and B are related conditioning on all other symptoms in the network) discard common covariation that may reflect a shared liability factor, a plausible and parsimonious explanation for symptom co-occurrence (Krueger & Markon, 2006). Second, because cross-sectional networks only represent between-persons variability, they are ill-suited to make conclusions about dynamic (i.e., temporally unfolding) relationships among symptoms at a within-person level. Finally, network analyses of cross-sectional symptom covariance matrices do not provide solid ground for making ontological inferences about the latent structure of psychopathology and therefore, have limited utility for advancing clinical theory and nomenclature.
Using a series of latent structure simulations, both network and latent trait, we demonstrate how network analyses provide descriptive accounts of latent structure that can be corroborated and tested more precisely using common latent variable analyses (e.g., latent trait models). We also demonstrate how network analyses are often vulnerable to making spurious inferences about the relationships among symptoms. Finally, we articulate how network analyses can be useful in guiding exploratory data analysis prior to testing formal models of latent structure.
In graph analyses of psychometric data, the Gaussian Graphical model (GGM) is the most common formal basis for network definition (Epskamp et al., in press, Psychometrika). Importantly, SEM is based on a model of the covariance matrix, whereas the GGM is based on the inverse covariance matrix (precision matrix), which measures partial correlations among two indicators net all other indicators. This has important implications for the assumed data generating process (e.g., that there is a unique relationship between two manifest indicators) and the representation of sparsity or constraint.
Under traditional SEM, sparsity is largely a function of the researcher’s belief about the structure of the data. To the extent that a covariance matrix can be reasonably represented fewer than \(\frac{p(p+1)}{2}\) parameters, then the model is overidentified (i.e., positive df). Thus, a common objective in SEM is to specify a model that is simpler than the saturated model but nevertheless represents the data well.
In the GGM, the estimation of edges (connections) among nodes is based on algorithms that attempt to optimize a criterion (often the extended BIC) by freeing different edges. To reduce spurious relationships and handle the problem of p >> n, it is common to regularize the network estimation using a LASSO penalty to control the sparsity of a network. Importantly, both the tuning operator \(\lambda\) and the hyperparameter \(\gamma\) that controls the additional complexity penalty on BIC (i.e., EBIC) are chosen by the researcher. Alternatively, several tuning parameters can be tested in order to identify a sparse network with the lowest EBIC. An important consideration here is that graphical LASSO (GLASSO) approaches are data driven algorithms to identify network structure with no input from the scientist on theoretically plausible parameters or constraints.
Network structures derived using a GLASSO + EBIC approach are then assumed to be sparser than the full-rank covariance matrix and are based on the precision matrix (as will be discussed below). Based on estimated edges in the network, researchers typically compute vertex (node) or edge (connection) metrics derived from graph theory. For example, the strength of a node is defined as the sum of its connection weights to other nodes (inversely related to the assumed distance between two variables):
\[s_i = \sum_{j}W_{ij}\]
Such metrics describe the topology of the network, but are not integrally linked to the estimation of the edge structure itself. That is, the metrics are not measures of the likelihood of the parameters given the data and model. Rather, the statistical expectation of the covariance matrix is derived from the GLASSO + EBIC GGM estimation mentioned above and the free parameters are the estimated elements of the precision matrix \(\Theta\). Likewise, in SEM, the parameters of the model jointly inform an estimate of the model-implied covariance matrix \(\Sigma\). Given this structure, we wish to underscore that analyses of graph metrics such as strength or closeness are secondary statistics in that they build upon characteristics of the formal model parameters.
The closest analogy in SEM or traditional statistics would be analyses of the characteristics of free parameters such as factor loadings. For example, one might analyze whether a set of factor loadings are approximately normally distributed.
In secondary statistical analyses of networks and SEMs, it is not usually possible to estimate the uncertainty of a parameter as would be typical for primary parameters. For example, the strength centrality of a node is the sum of its weights, but there is no uncertainty about this sum. This leads to some difficulties in statistical analyses of graphs based on the covariance structure of a group insofar as it is difficult to compare differences between nodes or across groups in graph structure. As a result, applied graph analyses for psychometric data often find differences in graph metrics with limited ability to adjudicate the strength of evidence. To an extent, this can be circumvented using nonparametric bootstrapping procedures on the graph (Epskamp bootnet paper). By contrast, because the free parameters are typically of substantive interest in SEM and estimation uncertainty is an integral part of maximum likelihood estimation (based on observed Fisher information), it is straightforward to compare parameters or groups statistically.
In a broad sense, statistical models are intended to be a parsimonious representation of the observed data and, ideally, to generalize to the processes that caused the data. Many latent variable analyses seek to represent underlying causative factors that lead to the observed relationships among manifest variables. One formal definition of a latent variable is that it causes the associations among observed variables and that after conditioning on the latent variable, observed variables are independent (conditional independence definition; Lord 1953). Nonformal definitions of latent variables, however, often portray them as “hypothetical constructs” or simply data reduction devices that provide a parsimonious description of observed data (Harman 1960). These diverging views are the basis of critiques of latent variables (e.g., Borsboom 2001), as well as rejoinders that latent variables need not be literally causal (Jonas & Markon 2016). [Expand summary based on Bollen 2002 latent variable review.]
With respect to latent variable and network models, it is crucial to examine the ontology of latent variables and graphical models and to decide to what extent each provides a plausible description of psychometric symptom data (the focus here). In the case of the GGM, there are three important ontological assumptions:
This quote from Hofmann Perspectives review provides a useful summary of the symptom network perspective: “Hence, a disorder is not the underlying cause of symptoms; it con- stitutes a network of symptoms that interact in ways that tend to maintain themselves. Accordingly, a stressful event does not activate an underlying entity called depres- sion, which then causes the emergence of symptoms. Rather, stressful events activate certain symptoms that, in turn, activate other symptoms, and when the requisite number of symptoms occurs, an episode of disorder may be diagnosable.” (p. 2).
Although one can question the premises of taxonomic research on psychopathology, an effective critique would need to contend with the large literature on defining and measuring constructs in psychiatry. The classic Feighner criteria (Feighner 1972) provided a set of recommendations for describing mental disorders based on the premise that an underlying clinical entity (i.e., a latent variable) causes the observed symptoms. For example, hallucinations and delusions in a given patient are though to reflect an underlying thought disorder. This ontology has informed a host of taxonomic studies and revisions in the past 40 years, including the DSM-5 etc. [Review McNally 2012 PTSD networks paper, which does get into ontology a bit more deeply… though it’s badly flawed].
More generally, psychological research is built on a foundation of developing theories of hypothetical constructs (Cronbach & Meehl). Intelligence is not literally a score on a working memory test, but rather working memory performance is an indicator of some deeper aptitude. Intelligence research is a particularly good example of the latent variable debate in that there is strong evidence of a shared factor (whether causal or descriptive) that underlies performance on a host of ability tests, as well as some evidence of more specific abilities (the old Spearman g versus s discussion). Classical test theory has built on the premise that a hypothetical construct can be best described by measuring several indicators in order to estimate the true level of the construct. This approach also takes into account how observations can be corrupted by measurement noise while also providing an estimate of the trait level after accounting for such noise. Finally, test theory has given rise to a canonical approach to measuring and validating constructs in psychology (Clark & Watson 1995) that is fundamentally built on the premise that there are hypothetical constructs to be measured. SEM can be viewed in part as an extension of classical test theory for validating hypotheses about latent structure.
[Consider reviewing Krueger & Markon comorbidity annual review where they get into different latent->observed mappings across symptoms.]
To overthrow the hypothetical construct approach, a fundamentally different ontology would need to be proposed. Consider a multi-car ‘pile-up’ collision on an icy roadway. This phenomenon can be caused by a chain reaction of individuals trying, but failing, to brake in order to reduce speed and avoid a collision. Such a scenario need not appeal to a latent variable insofar as a set of observable causal conditions are sufficient to explain the collisions: 1) an icy road that minimizes friction, 2) a set of independent heavy objects traveling at speed, leading to kinetic energy, and 3) the inability of the work done on the car over a distance d to equal its kinetic energy despite braking. Such causal dynamics have been appealed to in prior research on cross-sectional psychopathology symptom networks (Fried; Hofmann review), yet the dynamics that lead to observed scores on symptom measures are poorly understood.
We note that dynamic network models have an important place in disease research when the mechanisms of transmission are well understood. For example, diseases often spread through physical contact with an infected person, and predicting the rate and geographical spread of infection can help to contain dangerous outbreaks. Epidemiologists have used network models to understand how disease outbreak can be predicted by health-care utilization records in nearby geographical regions (Reis, Kohane, and Mandl, 2007 PLoS Medicine). Likewise, heterogeneity in the pattern and speed of disease outbreaks can be understood by modeling contact networks (Meyers et al., 2005, J Theoretical Biology). Our goal is not to undermine the utility of network models writ large, but to examine how symptom networks based on cross-sectional data can inform an understanding of the structure and taxonomy of psychopathology.
The representation of network analyses based on symptom covariance places manifest variables as the key atomic units of analysis and edges as the central explanatory parameters. [unpack]
[[Aidan: this would be a useful place to have your contributions on the cross-sectional dynamics problems]]
More specifically, a theory of psychopathology based on networks of symptoms would need to provide a plausible account of their associations. Ideally, such a theory would offer a hypothesis of how one symptom causes another, though this might depend on the pathophysiology, which could differ across symptoms. From an ontological perspective, it is not enough that a network approach provide a data-driven account of symptom covariance, even if the fit of such a model exceeds an alternative. This would only be a satisfactory solution if the primary goal of psychometric symptom modeling were to predict external outcomes such as social adjustment or probability of self-injury, for example. Traditionally, psychology has pursued explanatory theories that seek to describe the mechanisms that give rise to behavior (Yarkoni & Westfall, in press, Perspectives on Psyc Science). [Flesh out Yarkoni’s points about why prediction is a valid goal, but also underscore that clinical utility requires a conceptual framework for diagnosis. Also: latent variable approaches are not incompatible with pure prediction, such as a horse race between a network and latent variable structure in prospectively predicting substance abuse relapse.]
Advocates of psychometric network analyses have criticized specific assumptions of the SEM approach that may not be satisfied in empirical data (Borsboom 2001). First, they note that SEM assumes that manifest variables are independent after conditioning on latent variables. If the associations among variables are more complex (e.g., residual associations after conditioning), this violation of conditional independence may render the interpretation of latent variables (e.g., factor scores) invalid. Second, they criticize the idea that symptoms are caused by a latent variable such as a disease state. Criticism of the latent variable premise is largely a reframing of the conditional independence assumption in that network proponents argue that symptoms are conditionally independent after conditioning on each other, not on a hypothetical construct.
The conditional independence assumption is evident in network analysis and latent variable models, including categorical latent variables (e.g., latent class models). In network models, as mentioned above, it is the association among symptoms that is assumed conditionally independent after conditioning on all other variables. In factor models, one assumes conditional independence of indicators after accounting for the factor loadings. And in latent class models, one assumes independence of indicators after conditioning on class membership. Crucially, however, there is a large literature on relaxing this assumption in latent variable models in order to accommodate more complex structures. For example, measurement invariance [PMI: unpack]. For example: allow residual association among items [unpack]. For example: exploratory SEM or Bayesian SEM with small positive priors on cross loadings [unpack].
Indeed, a standard component of model fitting in SEM is to examine modification indices based on the Lagrange multiplier in order to examine whether certain key parameters are omitted from the model. Simply freeing these without conceptual consideration is likely to overfit the data [citation?] and reduce the replicability of the findings. However, a principled consideration of modification indices may reveal plausible and justifiable reasons for adding parameters into the model that were not part of the researcher’s initial hypothesis. [e.g., method variance or content similarity] The broader point is that modification indices are a well-worn tool for detecting violations of conditional independence in latent variable models.
Likewise, an important recommendation in SEM is to fit several alternative models in order to adjudicate the relative evidence of one model over another (Burnham & Anderson, 2002). If the goal of statistical modeling is to provide a parsimonious and reasonable approximation of the observed data, then model selection serves as an important step to consider alternative hypotheses about the data generating processes. In empirical data, these processes are typically unknown and may be rather complex, but by comparing relative evidence of competing models, one can revise an explanatory theory in light of the extent to which the theoretically predicted model best characterizes the data. Likewise, comparing model-predicted versus observed data serves as a check on the quality of fit produced by the model (e.g., residual diagnostics or posterior predictive checks).
[[Peter: do you think that model search algorithms in SEM are worth reviewing as a data-driven counterpart to the EBIC + GLASSO network estimation approach? This seems like one avenue to demonstrate that a search algorithm can largely free relevant covariance parameters in order to approximate the original data, akin to the GGM. A difference would be the use of the precision matrix versus covariance matrix. ]]
A major challenge with network analyses is that the scientist has little control of the parameterization or form of the model. Rather, edges are estimated in the network according to regularization and relative fit criteria. Although simulation studies of this approach suggest that it recovers simulated networks reasonably well (Epskamp Arxiv paper 2016) in terms of sensitivity and specificity, one is not easily able to provide a priori input into the structure of the network or the plausibility of a given edge. For example, based on past theory and research, one may view fatigue and anhedonia as neurovegetative signs of depression and have a strong prior belief that these should be directly connected by an edge. However, to the extent that the association between these items reflects common inputs from other depressive features such as sleep disturbance or sadness, the GGM may select a model lacking the expected edge. Whether this is a ‘feature’ or a ‘bug’ of the method is in the eye of the beholder. In addition, although one can view the EBIC optimization procedure in terms of model selection (comparing networks with different LASSO or complexity penalties), it is not common to compare networks estimated by different methods or procedures to select the best representation. Neither is it common in the symptom network literature to compare the fit of a network model to a plausible latent variable model (cf. Epskamp 2016 Psychometrika, however). In summary, the scientist plays only a minimal role in model building in symptom network analyses, instead interpreting secondary statistics such as nodal centrality measures in order to understand potential explanations for covariance in the data.
Some summary of the basic approach? Bootnet etc.?
The conditional independence assumption in networks only permits unitary direct causes among items. As a result, if responses to an item reflect two or more processes, this is poorly represented. For example, there is a fairly developed literature on positively (“I tend to worry a lot”) versus negatively keyed (“I don’t fret life’s troubles”) indicating that the wording of each type tends to induce response similarity independent of the content of the construct. Thus, both of the items above could be strong indicators of negative affect, but failing to account for systematic effects of wording my misestimate the latent construct of interest.
Likewise, in a bifactor model, a general factor may account for common variance among many items (e.g., all items on a pain inventory) whereas orthogonal specific factors may capture systematic patterns of covariation in subsets of items (e.g., dull pain versus shooting pain). This structure is not readily represented by a network model (simulation below).
[Method variance, cross-loading, etc.]
[The goal of this section is to a) introduce the formal mapping of latent variable models and network models, and b) suggest the idea that network metrics largely recapitulate parameters already present in SEM.]
[[Peter: here and elsewhere, it would be great to have your input on the link between these models, expanding on your BBS commentary. At present, the psychopathology network modelers are largely selling network models as a new tool that overcomes limitations of SEM. As you trenchantly argue in your commentary, however, modeling covariance in terms of observed variables alone is a rudimentary step that strongly assumes that the manifest variables are causative and that a more parsimonious representation cannot be obtained by considering latent causes.]]
Advocates of the network analysis approach (Hofmann 2015, 2016; McNally 2012) have argued that network analyses provide a perspective on symptom co-occurrence that is relatively free of assumptions of clinical entities or latent traits underlying the observed data. Moreover, this approach has been used to frame the centrality of a given symptom in theoretically meaningful terms. For example, if fatigue has high strength centrality in a network that includes mood and anxiety symptoms, it has been argued that this may reflect the importance of studying fatigue specifically [[Aidan: can you provide a better example from the papers you’ve encountered? This is just off the top of my head and I recall your having some juicy/silly examples]] as a causative factor in internalizing psychopathology. In this way, symptoms have become the focus of study such that symptoms that co-occur with many others may have greater conceptual importance than symptoms with more sparse patterns of co-occurrence.
We assert that network models applied to psychometric data may often recapitulate formal parameters of a related SEM while providing little unique information about the structure of the data. Nodal centrality metrics may serve as useful descriptive statistics to help a scientist develop a better sense of covariation patterns that would be hard to perceive in a large covariance matrix printed in tabular form. We view network analyses of observed items as a useful form of exploratory data analysis that builds on high-dimensional visualization tools (e.g., Emerson et al., 2012, Generalized Pairs Plot). However, such descriptive statistics and visualizations should not be conflated with formal models whose parameters seek to represent the underlying data. [this stuff may belong above in the formal model versus descriptive stats section]
Here, we consider how data generated from a variety of latent factor models (in SEM) is represented using network analyses. We also consider how data generated by a network model (the GGM) are represented from a latent variable perspective. We make three broad assertions about network analyses of data generated by latent variable models:
All simulated data in this section are generated from confirmatory factor models fit within an SEM framework. The basic simulation structure uses standardidized factors (variance = 1.0) and equates item residual variances. Data are simulated using the simsem package in R based on lavaan syntax that generates the model-implied mean and covariance structure. The simCFAGraphs graphs function is the main worker that simulates data according to a factor model specification (in particular, \(\lambda\), \(\theta\), and \(\Psi\) matrices) and fits the data using the EBICglasso function from the qgraph package. This follows the recommendations of a recent tutorial paper on generating psychopathology symptom networks (Epskamp: Estimating Psychological Networks and their Accuracy: A Tutorial Paper). Graphs are setup using the igraph package and centrality measures (strength, closeness, and betweenness at present to align with this literature) are derived.
H1: In factor models with simple structure, strength centrality (or degree for binarized networks) will be redundant with factor loadings.
The premise of this hypothesis is that variables with higher factor loadings are better representations of the latent variable and will thus tend to be more strongly associated with other indicators of the latent variable. As a result, strength centrality (sum of association strengths) will be strongly associated with the factor loading.
In multi-factor solutions with (reasonably) simple structure, this will hold true as well because a variable that is connected to its primary factor will be associated with other items on the same factor in proportion to its loading.
Simulate data from a one-factor CFA with 10 indicators, 200 replications, N=400, and factor loadings drawn from a random uniform sampler between .4 and .95. Assume equal residual variance of 0.3 (30% unexplained variance). Twenty “examples” are generated: an example in my shorthand is a replication of the meta-structure to ensure that results are not peculiar to a given set of replications or loadings.
Means and standard deviations factor loadings correlations with graph centrality measures
## just get a mean and SD of correlations between each nodal
## measure and factor loadings
corrvgraph_glasso <- do.call(rbind, lapply(dd, function(example) {
example$graph_v_factor$EBICglasso$corr_v_fitted
}))
corrvgraph_pcor <- do.call(rbind, lapply(dd, function(example) {
example$graph_v_factor$pcor$corr_v_fitted
}))
kable(corrvgraph_glasso %>% group_by(measure, factor) %>% summarize_all(funs(mean,
sd)), digits = 2, caption = "Association of GLASSO graph measures with 1-factor CFA loadings")
| measure | factor | mean | sd |
|---|---|---|---|
| betweenness | f1 | 0.70 | 0.08 |
| closeness | f1 | 0.95 | 0.01 |
| strength | f1 | 0.98 | 0.00 |
kable(corrvgraph_pcor %>% group_by(measure, factor) %>% summarize_all(funs(mean,
sd)), digits = 2, caption = "Association of PCOR graph measures with 1-factor CFA loadings")
| measure | factor | mean | sd |
|---|---|---|---|
| betweenness | f1 | 0.69 | 0.08 |
| closeness | f1 | 0.94 | 0.01 |
| strength | f1 | 0.97 | 0.01 |
smat <- do.call(rbind, lapply(dd, function(rep) {
brep <- filter(rep$graph_v_factor$EBICglasso$metric_v_loadings,
node == "y1" & measure == "strength" & fittedloading <
1)
}))
smat$graphNum <- rep(1:(nrow(smat)/2), each = 2) #smat has two rows per replication. Need to identify these to get the spread to work properly
bb <- smat %>% select(graphNum, value, factor, fittedloading) %>%
spread(key = factor, value = fittedloading) #graphNum
g <- ggplot(bb, aes(x = f1, y = value)) + geom_point(alpha = 0.5) +
stat_smooth(method = "lm") + xlab("Fitted factor loading") +
ylab("Strength centrality") + annotate(geom = "text", x = 0.4,
y = 1.1, label = "r = 0.97") + theme_cowplot(font_size = 20)
pdf("strength 1-factor loadings.pdf", width = 5, height = 4)
plot(g)
dev.off()
## quartz_off_screen
## 2
kable(broom::tidy(cor.test(~fittedloading + value, filter(smat,
factor == "f1"))), digits = 2, caption = "Association of strength with primary factor loading (f1)")
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0.93 | 157.96 | 0 | 3946 | 0.92 | 0.93 | Pearson’s product-moment correlation | two.sided |
cmat <- do.call(rbind, lapply(dd, function(rep) {
brep <- filter(rep$graph_v_factor$EBICglasso$metric_v_loadings,
node == "y1" & measure == "closeness" & fittedloading <
1)
}))
cmat$graphNum <- rep(1:(nrow(cmat)/2), each = 2) #smat has two rows per replication. Need to identify these to get the spread to work properly
bb <- cmat %>% select(graphNum, value, factor, fittedloading) %>%
spread(key = factor, value = fittedloading) #graphNum
g <- ggplot(bb, aes(x = f1, y = value)) + geom_point(alpha = 0.3) +
stat_smooth(method = "lm") + xlab("Fitted factor loading") +
ylab("Closeness centrality") + annotate(geom = "text", x = 0.4,
y = 0.15, label = "r = 0.96") + theme_cowplot(font_size = 20)
pdf("closeness 1-factor loadings.pdf", width = 5, height = 4)
plot(g)
dev.off()
## quartz_off_screen
## 2
20 indicators (10 per factor) with random loadings and equal residual variances as above. Other settings as above.
Note that GLASSO maintains a high correlation in the two-factor case, whereas a simple partial correlation basis begins to diminish in its linear dependency with loadings.
corrvgraph_glasso <- do.call(rbind, lapply(dd2, function(example) {
example$graph_v_factor$EBICglasso$corr_v_fitted
}))
corrvgraph_pcor <- do.call(rbind, lapply(dd2, function(example) {
example$graph_v_factor$pcor$corr_v_fitted
}))
kable(corrvgraph_glasso %>% group_by(measure, factor) %>% summarize_all(funs(mean,
sd)), digits = 2, caption = "Association of GLASSO graph measures with 2-factor CFA loadings")
| measure | factor | mean | sd |
|---|---|---|---|
| betweenness | f1 | 0.10 | 0.08 |
| betweenness | f2 | 0.11 | 0.07 |
| closeness | f1 | 0.43 | 0.07 |
| closeness | f2 | 0.44 | 0.07 |
| strength | f1 | 0.98 | 0.01 |
| strength | f2 | 0.98 | 0.01 |
kable(corrvgraph_pcor %>% group_by(measure, factor) %>% summarize_all(funs(mean,
sd)), digits = 2, caption = "Association of PCOR graph measures with 2-factor CFA loadings")
| measure | factor | mean | sd |
|---|---|---|---|
| betweenness | f1 | 0.58 | 0.09 |
| betweenness | f2 | 0.59 | 0.09 |
| closeness | f1 | 0.67 | 0.07 |
| closeness | f2 | 0.67 | 0.08 |
| strength | f1 | 0.86 | 0.04 |
| strength | f2 | 0.86 | 0.05 |
## `geom_smooth()` using method = 'gam'
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0.93 | 161.47 | 0 | 3998 | 0.93 | 0.94 | Pearson’s product-moment correlation | two.sided |
H2: In a 2-factor (or larger) structure, a cross-loading will tend to increase the centrality of that variable in proportion to the joint magnitude of its factor loadings. This highlights a limitation of network analyses in that multiple causes are not accommodated.
For the first simulation, we specify a 2-factor model (9 per factor) with fixed 0.6 loadings. Items y1-y9 loading onto f1, and y10-y18 load onto f2. We then randomly vary the magnitude of a cross-loading of y2 onto f2. 500 replications of this structure with N=400 are simulated.
We do not see support for an association of the cross-loading magnitude with node betweenness.
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0.01 | 0.38 | 0.7 | 2498 | -0.03 | 0.05 | Pearson’s product-moment correlation | two.sided |
Correlation of fitted loading with secondary factor (runif 0.2–0.8)
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| -0.02 | -0.86 | 0.39 | 2498 | -0.06 | 0.02 | Pearson’s product-moment correlation | two.sided |
Joint prediction of betweenness by both loadings
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| f1 | 2.14 | 5.51 | 0.388 | 0.698 |
| f2 | -0.955 | 1.11 | -0.861 | 0.389 |
| (Intercept) | 59.9 | 3.35 | 17.9 | 2.57e-67 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 2500 | 10.1 | 0.000355 | -0.000446 |
The magnitude of the cross-loading does strongly influence the closeness centrality of the target node y2.
Correlation of closeness with fitted primary factor loading (loading varies ~0.6)
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0.1 | 4.84 | 0 | 2498 | 0.06 | 0.14 | Pearson’s product-moment correlation | two.sided |
Correlation of closeness with fitted secondary factor loading (runif 0.2–0.8)
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0.84 | 76.92 | 0 | 2498 | 0.83 | 0.85 | Pearson’s product-moment correlation | two.sided |
## `geom_smooth()` using method = 'gam'
Joint prediction of closeness by both loadings
pander(summary(lm(value ~ f1 + f2, bb)))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| f1 | 0.0285 | 0.00344 | 8.29 | 1.81e-16 |
| f2 | 0.0539 | 0.000692 | 77.9 | 0 |
| (Intercept) | 0.0355 | 0.00209 | 17 | 4.15e-61 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 2500 | 0.00632 | 0.711 | 0.711 |
Correlation of strength with fitted primary factor loading (loading varies ~0.6)
# correlation of node betweenness and core factor
broom::tidy(cor.test(~fittedloading + value, filter(smat, factor ==
"f1")))
## estimate statistic p.value parameter conf.low conf.high
## 1 0.1798751 9.139221 1.265938e-19 2498 0.1416714 0.2175438
## method alternative
## 1 Pearson's product-moment correlation two.sided
bb <- smat %>% select(graphNum, value, factor, fittedloading) %>%
spread(key = factor, value = fittedloading) #graphNum contains the identifying column
# ggplot(bb, aes(x=f1, y=value)) + geom_point() +
# stat_smooth()
Correlation of fitted loading with secondary factor (runif 0.2–0.8)
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0.86 | 85.95 | 0 | 2498 | 0.85 | 0.87 | Pearson’s product-moment correlation | two.sided |
## `geom_smooth()` using method = 'gam'
Joint prediction of betweenness by both loadings
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| f1 | 0.983 | 0.0539 | 18.3 | 5.59e-70 |
| f2 | 0.991 | 0.0108 | 91.3 | 0 |
| (Intercept) | 0.233 | 0.0328 | 7.1 | 1.6e-12 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 2500 | 0.099 | 0.777 | 0.777 |
Here, instead of varying the magnitude of one cross-loading while the other remains fixed, we simulate data along a 2-D grid in which the factor loading of y2 on f1 varies from 0.2–0.8 and its factor loading on f2 also varies from 0.2–0.8. The hypothesis is that centrality measures will be a joint function of the loading magnitudes such that high loadings on both will be associated with high centrality. This is a more powerful demonstration of the point above that network metrics do not allow one to unmix multiple causes.
Hmisc::rcorr(as.matrix(corr_rand))
## wif1m wif2m bwf1f2m f1y2m f2y2m f1..y2 f2..y2
## wif1m 1.00 0.05 0.06 0.00 0.06 0.03 0.08
## wif2m 0.05 1.00 0.01 -0.16 0.12 -0.14 0.06
## bwf1f2m 0.06 0.01 1.00 0.04 -0.04 -0.03 -0.06
## f1y2m 0.00 -0.16 0.04 1.00 -0.65 0.91 -0.34
## f2y2m 0.06 0.12 -0.04 -0.65 1.00 -0.35 0.90
## f1..y2 0.03 -0.14 -0.03 0.91 -0.35 1.00 0.00
## f2..y2 0.08 0.06 -0.06 -0.34 0.90 0.00 1.00
##
## n= 3380
##
##
## P
## wif1m wif2m bwf1f2m f1y2m f2y2m f1..y2 f2..y2
## wif1m 0.0050 0.0003 0.7714 0.0007 0.0598 0.0000
## wif2m 0.0050 0.5965 0.0000 0.0000 0.0000 0.0004
## bwf1f2m 0.0003 0.5965 0.0230 0.0316 0.0963 0.0010
## f1y2m 0.7714 0.0000 0.0230 0.0000 0.0000 0.0000
## f2y2m 0.0007 0.0000 0.0316 0.0000 0.0000 0.0000
## f1..y2 0.0598 0.0000 0.0963 0.0000 0.0000 0.8475
## f2..y2 0.0000 0.0004 0.0010 0.0000 0.0000 0.8475
# summary(lm(bwf1f2m ~ f1..y2 * f2..y2, corr_rand))
Conclusions about effects of cross-loading on correlations among items:
kable(broom::tidy(cor.test(~fittedloading + value, filter(bmat_rand,
factor == "f2"))), digits = 2, caption = "Correlation of betweenness with f2 fitted loading")
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0 | -0.25 | 0.81 | 3378 | -0.04 | 0.03 | Pearson’s product-moment correlation | two.sided |
kable(broom::tidy(cor.test(~fittedloading + value, filter(bmat_rand,
factor == "f2"))), digits = 2, caption = "Correlation of betweenness with f1 fitted loading")
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0 | -0.25 | 0.81 | 3378 | -0.04 | 0.03 | Pearson’s product-moment correlation | two.sided |
kable(broom::tidy(cor.test(~fittedloading + value, filter(cmat_rand,
factor == "f2"))), digits = 2, caption = "Correlation of closeness with f2 fitted loading")
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0.61 | 45.26 | 0 | 3378 | 0.59 | 0.63 | Pearson’s product-moment correlation | two.sided |
kable(broom::tidy(cor.test(~fittedloading + value, filter(cmat_rand,
factor == "f1"))), digits = 2, caption = "Correlation of closeness with f1 fitted loading")
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0.5 | 33.91 | 0 | 3378 | 0.48 | 0.53 | Pearson’s product-moment correlation | two.sided |
kable(broom::tidy(cor.test(~fittedloading + value, filter(smat_rand,
factor == "f2"))), digits = 2, caption = "Correlation of strength with f2 fitted loading")
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0.67 | 52.94 | 0 | 3378 | 0.65 | 0.69 | Pearson’s product-moment correlation | two.sided |
kable(broom::tidy(cor.test(~fittedloading + value, filter(smat_rand,
factor == "f1"))), digits = 2, caption = "Correlation of strenght with f1 fitted loading")
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 0.66 | 51.4 | 0 | 3378 | 0.64 | 0.68 | Pearson’s product-moment correlation | two.sided |
# bb_rand <- smat_rand %>% select(graphNum, value, factor,
# fittedloading) %>% spread(key=factor, value=fittedloading)
bb_rand <- smat_rand %>% select(graphNum, value, factor, poploading) %>%
spread(key = factor, value = poploading) #graphNum contains the identifying column
# ggplot(bb_rand, aes(x=f1, y=value)) + geom_point() +
# stat_smooth()
pander(summary(lm(value ~ f1 + f2, bb_rand))) #so measure becomes hugely predicted by each loading
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| f1 | 0.987 | 0.00996 | 99.2 | 0 |
| f2 | 1.01 | 0.00996 | 102 | 0 |
| (Intercept) | 0.203 | 0.00728 | 27.8 | 1.62e-153 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 3380 | 0.108 | 0.856 | 0.856 |
# aggregate replications within each combination of f1 and
# f2loading
bbagg <- bb_rand %>% group_by(f1, f2) %>% summarize(value = mean(value))
# 3d plot of sorts
pdf("strength f1f2.pdf", width = 7, height = 5)
ggplot(bbagg, aes(x = f1, y = f2, fill = value)) + geom_tile() +
# scale_color_gradient() +
scale_fill_viridis("Strength", option = "viridis") + xlab("Factor 1 loading") +
ylab("Factor 2 loading") + theme_cowplot(font_size = 20)
# ggtitle('Association of f1 (x) and f2 (y) loadings with
# strength (color)')
dev.off()
## quartz_off_screen
## 2
# geom_jitter(size=3, position = position_jitter(width=0.03,
# height=0.03)) +
Simulate residual correlations on 2-D grid of cross-loadings for four items: y1, y2 (f1); y10, y11 (f2) Residual correlations range in strength from 0.0 - 0.8 in .05 increments while holding constant the primary factor loadings at 0.6.
# @smat_rand <- do.call(rbind,
smat_rand <- do.call(bind_rows, lapply(mlist_crossload, function(rep) {
# filter to only those items that allowed resid corr
srep <- filter(rep$graph_v_factor$EBICglasso$metric_v_loadings,
node %in% c("y1", "y2", "y10", "y11") & measure == "closeness")
})) #bind_rows from dplyr automatically fills NAs (this is needed here since 0, 0 is one cell in the crossloading)
pander(summary(lm(value ~ fittedloading, filter(smat_rand, node ==
"y1"))))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| fittedloading | 0.00453 | 0.0181 | 0.251 | 0.802 |
| (Intercept) | 0.0698 | 0.0108 | 6.44 | 1.58e-10 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1445 | 0.0233 | 4.37e-05 | -0.000649 |
# pander(summary(lm(value ~ rcorr_pop, filter(smat_rand,
# node=='y1'))))
pander(summary(lm(value ~ rcorr_fitted, filter(smat_rand, node ==
"y1"))))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| rcorr_fitted | 0.266 | 0.0049 | 54.3 | 0 |
| (Intercept) | 0.0401 | 0.000714 | 56.2 | 0 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1360 | 0.0129 | 0.685 | 0.684 |
pander(summary(lm(value ~ rcorr_fitted + I(rcorr_fitted^2), filter(smat_rand,
node == "y1"))))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| rcorr_fitted | 0.281 | 0.0187 | 15 | 3.59e-47 |
| I(rcorr_fitted^2) | -0.0578 | 0.0708 | -0.816 | 0.415 |
| (Intercept) | 0.0395 | 0.00107 | 37 | 1.91e-207 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1360 | 0.0129 | 0.685 | 0.684 |
pander(summary(lm(value ~ rcorr_fitted * fittedloading, filter(smat_rand,
node == "y1"))))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| rcorr_fitted | 0.267 | 0.0922 | 2.9 | 0.00378 |
| fittedloading | 0.0102 | 0.0207 | 0.494 | 0.621 |
| rcorr_fitted:fittedloading | -0.0019 | 0.154 | -0.0124 | 0.99 |
| (Intercept) | 0.034 | 0.0124 | 2.74 | 0.00625 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1360 | 0.0129 | 0.685 | 0.684 |
# correlations are added to theta by r * sqrt(V1)*sqrt(V2) to
# respect the residual variances of the indicators to scale
# back onto correlation, we undo this by standardization: b /
# [sqrt(V1) * sqrt(V2)] here we have indicators of 0.3 in
# both cases, so just divide by 0.3. Technically should use
# fitted vars to get this exactly right, but this is a
# trivial difference range(smat_rand$rcorr_pop/0.3, na.rm=T)
ggplot(smat_rand, aes(x = rcorr_fitted/0.3, y = value, color = fittedloading)) +
facet_wrap(~node) + geom_point() + ggtitle("Association of estimated residual correlation (x) with strength centrality (y)") +
xlab("Estimated residual correlation of variable")
## Warning: Removed 340 rows containing missing values (geom_point).
smat_rand$graphNum <- rep(1:(nrow(smat_rand)/2), each = 2) #smat has two rows per replication. Need to identify these to get the spread to work properly
# bb_rand <- smat_rand %>% select(graphNum, value, factor,
# poploading) %>% spread(key=factor, value=poploading)
# #graphNum contains the identifying column ggplot(bb_rand,
# aes(x=f1, y=value)) + geom_point() + stat_smooth()
# pander(summary(lm(value ~ f1 + f2, bb_rand))) #so measure
# becomes hugely predicted by each loading
###
# should probably reshape to allow for MLM where nodes are
# random initial corroboration for hypothesis 2 mm <-
# do.call(rbind, mlist) my1 <- filter(mm, node=='y1')
# summary(lm(value ~ y10, my1)) #betweenness of y1 predicted
# by residual correlation with y10 cor.test(~ value + y10,
# my1) #r ~ 0.55 my10 <- filter(mm, node=='y10')
# summary(lm(value ~ y1, my10)) #betweenness of y10 predicted
# by residual correlation with y1 cor.test(~ value + y1,
# my10)
An important consideration that is unexplored in these initial simulations is the importance of characterizing subgraph structure, often by parcelling into connected subcomponents. Or in graph analyses, this is the problem of community detection (of which there are several algorithms). Granted, these algorithms are largely inline with factor analyses! Regardless, in graphs with moderate to high modularity, metrics within and between module should not be confounded by aggregated statistics.
Bifactor example
Disproportionate influence of residual associations in the presence of positive manifold. This is largely a problem of the partial correlation/precision idea.
Considerations of subgraph structure. Metrics are often reported blind to the idea of subgraphs, network experts rarely overlook considerations of modularity, connected components, and so on. This pulls us back toward factor models vs. community detection, which is another parallel
So, we need to basically parametrically manipulate the amount of variation explained by a latent variable versus a unique bivariate association.
The cleanest example might be a 2-factor structure with a single residual association across factors. Does the relative magnitude of this association scale with the strength of the GLASSO edge?
Or maybe even a 1-factor structure with a residual association. Does it look dramatically different from a one-factor model?
On further reflection, the approach above of varying the cross-loading magnitude in terms of percent variance is a relatively useful, but redundant, extension of the cross-loading example above (that multiple causes cannot be accommodated). What we really need is a true shared variance versus unique cause situation.
I think this is probably cleanest through a two-factor CFA with an additional factor that scales the unique association of two items.
TODO: What happens if factors start to correlate? Do we start to overvalue cross-factor associations over within-factor associations? This would be the broadband prediction of the partial correlation failure notion.
y2eff <- do.call(rbind, lapply(mlist_uniq, function(cellreps) {
filter(cellreps$graph_v_factor$EBICglasso$metric_v_loadings,
node == "y2")
}))
cor.test(~value + poploading, subset(y2eff, factor == "f3" &
measure == "betweenness"))
##
## Pearson's product-moment correlation
##
## data: value and poploading
## t = 12.963, df = 1297, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2896295 0.3859639
## sample estimates:
## cor
## 0.338684
cor.test(~value + poploading, subset(y2eff, factor == "f1" &
measure == "betweenness"))
##
## Pearson's product-moment correlation
##
## data: value and poploading
## t = -6.1065, df = 1277, p-value = 1.349e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2212129 -0.1146873
## sample estimates:
## cor
## -0.1684419
plot(~value + poploading, subset(y2eff, factor == "f3" & measure ==
"strength"))
plot(~value + poploading, subset(y2eff, factor == "f1" & measure ==
"strength"))
cor.test(~value + poploading, subset(y2eff, factor == "f3" &
measure == "strength"))
##
## Pearson's product-moment correlation
##
## data: value and poploading
## t = -15.59, df = 1297, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4421084 -0.3504543
## sample estimates:
## cor
## -0.3972716
cor.test(~value + poploading, subset(y2eff, factor == "f1" &
measure == "strength"))
##
## Pearson's product-moment correlation
##
## data: value and poploading
## t = 18.276, df = 1277, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4107703 0.4977228
## sample estimates:
## cor
## 0.4553316
cor.test(~value + poploading, subset(y2eff, factor == "f3" &
measure == "closeness"))
##
## Pearson's product-moment correlation
##
## data: value and poploading
## t = 9.3616, df = 1297, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1999280 0.3018415
## sample estimates:
## cor
## 0.2515821
cor.test(~value + poploading, subset(y2eff, factor == "f1" &
measure == "closeness"))
##
## Pearson's product-moment correlation
##
## data: value and poploading
## t = -0.75912, df = 1277, p-value = 0.4479
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07596322 0.03361441
## sample estimates:
## cor
## -0.02123819
plot(~value + poploading, subset(y2eff, factor == "f3" & measure ==
"closeness")) #insanely quadratic
plot(~value + poploading, subset(y2eff, factor == "f1" & measure ==
"closeness")) #insanely quadratic
# temporarily: the quadratic shapes here show that as an item
# cross-loads on two factors, it becomes highly central in
# the network
# try plotting a network at low, medium, and high levels of
# f3
qgraph(mlist_uniq[[1]]$adjmats$EBICglasso$average) #0.0 loading on f1, 0.80 loading on u (so y2 and y11 are part of u and not f1)
qgraph(mlist_uniq[[32]]$adjmats$EBICglasso$average) #0.56 loading on f1, 0.57 loading on u (about equal for y2 and y11)
qgraph(mlist_uniq[[65]]$adjmats$EBICglasso$average) #0.8 loading on f1, ~0 loading on u (so y2 and y11 are part of f1 and not u)
y2y11_edge <- sapply(mlist_uniq, function(cell) {
cell$adjmats$EBICglasso$average["y2", "y11"]
})
loading_grid$uedge <- y2y11_edge
cor.test(~uedge + f_loading, loading_grid)
##
## Pearson's product-moment correlation
##
## data: uedge and f_loading
## t = -32.51, df = 63, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9825574 -0.9534835
## sample estimates:
## cor
## -0.9714648
cor.test(~uedge + u_loading, loading_grid)
##
## Pearson's product-moment correlation
##
## data: uedge and u_loading
## t = 28.125, df = 63, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9388998 0.9769814
## sample estimates:
## cor
## 0.9624101
plot(~uedge + I(f_loading^2), loading_grid)
plot(~uedge + I(u_loading^2), loading_grid)